22. Applying Probes and Concepts

  • application to case study

  • testing concept vectors

  • relationship with sufficiency

  • user study evaluation

Learning Outcomes

  • Describe the theoretical foundation of post-hoc explanation methods like SHAP and linear probes values and apply them to realistic case studies with appropriate validation checks
  • Analyze large-scale foundation models using methods like sparse autoencoders and describe their relevance to problems of model control and AI-driven design.
  • Within a specific application context, evaluate the trade-offs associated with competing interpretable machine learning techniques.

Readings

Alain, G., & Bengio, Y. (2017). Understanding intermediate layers using linear classifier probes. OpenReview. https://openreview.net/forum?id=ryF7rTqgl

Schmalwasser, L., Penzel, N., Denzler, J., & Niebling, J. (2025). FastCAV: Efficient computation of concept activation vectors for explaining deep neural networks. In Proceedings of the 42nd International Conference on Machine Learning. https://openreview.net/forum?id=kRmfzTfIGe

Extracting activations

In PyTorch, use register_forward_hook to capture \(h_l(x)\) during forward pass.

activations = {}

def activation_fun(name):
    def hook(model, input, output):
        activations[name] = output.detach()
    return hook

# Register hooks on layers of interest
model = torchvision.models.resnet50(pretrained=True).eval()
model.layer3.register_forward_hook(activation_fun('layer3'))

Testing CAVs

Spurious Concepts: Why Test?

  • High-dimensional spaces (\(d=2048\)): almost any two 50-image sets are separable. Random directions may align with gradients by chance.

  • Need to calibrate. Is TCAV=0.6 high or just noise?

Step 4. Let \(N_k\) be the number of images in class \(k\). The TCAV Score (\(T_k\)) is:

\[T_k = \frac{1}{N_k}\left|\{x \in \text{Class } k : S_k(x) > 0\}\right|\]

  • \(T_k > 0.5\): concept generally increases prediction
  • \(T_k < 0.5\): concept generally decreases prediction


Concept: “stripes”

Step 5. Test significance by learning reference \(\vec{v}\) trained on random image sets.

  • Run CAV process 50+ times with different random negative sets
  • Generate null distribution of TCAV scores
  • Two-sided t-test: is concept’s \(T_k\) statistically significant (\(p < 0.05\))?


Concept: “stripes”

Real-World Example: Medical Imaging

  • Model. Melanoma detector
  • TCAV audit. High sensitivity to “ruler markings”
  • Conclusion. Spurious correlation (rulers mark dangerous lesions)

Concepts data structure

We need three things.

  1. Concept Sets. Folders of images representing “trees,” “water,” “parking,” etc.
  2. Random Pool. Large, diverse set of images (e.g., ImageNet) for null hypothesis
  3. Layer Activations. Dictionary mapping layer_name \(\to\) tensor(N, dim)

Concepts data structure

from captum.concept import TCAV, Concept
from captum.attr import LayerIntegratedGradients

# 1. load model and specify layers
model = torchvision.models.googlenet(pretrained=True).eval()
layers = ['inception4c', 'inception4d', 'inception4e']

# 2. folders contain example images
stripes = Concept(id=0, name="striped", data_iter=load_concept("striped/"))
random = Concept(id=1, name="random", data_iter=load_concept("random/"))

Estimating concept directions

# 3. initialize tcav class
tcav = TCAV(model=model, layers=layers,
            layer_attr_method=LayerIntegratedGradients(model, None))

# 4. TCAV test: does "stripes" influence "zebra" predictions?
scores = tcav.interpret(inputs=zebra_images,
                        experimental_sets=[[stripes, random]],
                        target=340)  # zebra class

FastCAV

from fastcav import FastCAVCaptumClassifier

tcav = TCAV(model=model, layers=layers,
            classifier=FastCAVCaptumClassifier(),  # <-- only change
            layer_attr_method=LayerIntegratedGradients(model, None))

# Same workflow, identical results, faster
scores = tcav.interpret(inputs=zebra_images, experimental_sets=[[stripes, random]], target=340)

Outputs

TCAV Score. Nested dictionary structure

tcav_scores = {
    '0-3': {  # concept pair IDs
        'inception4c': {
            'sign_count': tensor([0.98, 0.02]),  # fraction positive
            'magnitude': tensor([1.97, -1.97])    # average sensitivity
            ...
}

sign_count. Fraction of inputs where concept increases prediction (TCAV score proper)

magnitude. Average directional sensitivity across inputs

sign_count[0] = 0.98 means “stripes” positively influences 98% of zebra predictions

User Studies

Goals

Example Study

Experimental Design

Results

Interpretation

Exercise: Complementarity